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ABSTRACT 

We propose a novel vector aggregation technique for compact 
video representation, with application in accurate similarity 
detection within large video datasets. The current state-of- 
the-art in visual search is formed by the vector of locally 
aggregated descriptors (VLAD) of Jegou et al. VLAD gener¬ 
ates compact video representations based on scale-invariant 
feature transform (SIFT) vectors (extracted per frame) and 
local feature centers computed over a training set. With the 
aim to increase robustness to visual distortions, we propose 
a new approach that operates at a coarser level in the fea¬ 
ture representation. We create vectors of locally aggregated 
centers (VLAC) by first clustering SIFT features to obtain 
local feature centers (LFCs) and then encoding the latter with 
respect to given centers of local feature centers (CLFCs), 
extracted from a training set. The sum-of-differences be¬ 
tween the LFCs and the CLFCs are aggregated to generate an 
extremely-compact video description used for accurate video 
segment similarity detection. Experimentation using a video 
dataset, comprising more than 1000 minutes of content from 
the Open Video Project, shows that VLAC obtains substan¬ 
tial gains in terms of mean Average Precision (mAP) against 
VLAD and the hyper-pooling method of Douze et al., under 
the same compaction factor and the same set of distortions. 

Index Terms — video similarity, vector of locally aggre¬ 
gated descriptors, scale-invariant feature transform 

1. INTRODUCTION 

Recommendation services, event detection, clustering and 
categorization of video data, and retrieval algorithms for 
large video databases depend on efficient and reliable sim¬ 
ilarity identification amongst video segments H] E] O. In 
a nutshell, given a query video, we wish to find all similar 
video segments within a large video database in the most 
reliable and efficient way. The state-of-the-art in similarity 
identification hinges on video fingerprinting algorithms liiia. 
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The aim of such algorithms is to provide for distinguishable 
representations that remain robust under visual distortions, 
such as, rotation, compression, blur, resizing, flicker, etc. 
Such distortions are expected to be present within large video 
collections, or when dealing with content “in the wiW 0 . 

In a broad sense, video similarity identification can be 
seen as a spatio-temporal matching problem via an appropri¬ 
ate feature space or descriptor. Recent results have shown that 
similarity identification algorithms based on local descriptors, 
such as the scale invariant feature transform (SIFT) 0 or 
dense SIFT Q, tend to significantly outperform previous ap¬ 
proaches based on histogram methods IH or fingerprinting 
algorithms 13, especially under the presence of distortions in 
the video data. Therefore, the state-of-the-art in this area is 
based on vectors of locally aggregated descriptors (VLAD) 
cni, or Bag-of-Words (BoW) methods O, which merge 
feature descriptors in video frame regions. More recently, 
hyper-pooling approaches have been proposed Q, which per¬ 
form two consecutive VLAD stages in order to compact entire 
video sequences into a unique aggregated descriptor vector. 

In this paper, we focus on VLAD-based algorithms and 
examine the problem of creating compact representations that 
are suitable for efficient and accurate similarity identification 
of segments of videos within a large video collection. The 
paper makes the following contributions: 

• Instead of creating holistic hyper-pooling approaches 
for entire video sequences, we concentrate on groups 
of frames (GoFs) within a video sequence in order to 
allow for video segment search. 

• Instead of directly compacting feature descriptors, we 
follow a two-stage clustering approach: we first clus¬ 
ter features to obtain local-feature-centers (LFCs) and 
then encode the latter with respect to a given set of cen¬ 
ters of local-feature-centers (CLFCs), computed from a 
training set. 

• Similar to VLAD, we encode the LFCs by aggregat¬ 
ing their differences with respect to their corresponding 
CLFCs, thereby creating vectors of locally aggregated 
centers (VLAC). 



• Experiments using a 100-minute training set and a 
1000-minute test set from the Open Video Project re¬ 
veal that, for the same compaction factor, our proposal 
is outperforming the state-of-the-art VLAD method 
m by more than 15% in terms of mean Average 
Precision (mAP). 

The remainder of the paper is as follows. Section |2] sum¬ 
marizes the operation of VLAD and hyper-pooling that con¬ 
stitute the state-of-the-art and form the basis of the pro¬ 
posed compaction algorithm. Section[3presents the proposed 
VLAC approach. Section 0] presents experimental results, 
while Section|5]draws concluding remarks. 

2. BACKGROUND ON VLAD AND 
HYPER-POOLING 

2.1. Visual Feature Description 

Current solutions make use of image descriptors to represent 
individual frames within a video naa. After extracting the 
local feature descriptors of a given set of frames using an 
algorithm such as SILT ||6l or dense SILT 4oel7|, these de¬ 
scriptors are then accumulated to produce a compact frame 
representation. Recent work advocated the use of pooling 
strategies instead of simple averaging methods, in order to 
minimize information loss. A common way to achieve this 
is by using BoW methods ifTTll or VLAD ifTOl . In this paper, 
we focus on the latter as it has been shown to achieve state- 
of-the-art results in terms of mAP in medium and large-scale 
sets of image and video content. 

2.2. Vector of Locally Aggregated Descriptors 

VLAD ifTl fTOll is a vector aggregation algorithm that produces 
a fixed-size compact description of a set comprising a vari¬ 
able number of data points. VLAD was proposed as a novel 
approach aimed to optimize: (i) the representation of aggre¬ 
gated feature descriptors; (ii) the dimensionality reduction; 
(Hi) the indexing of the output vectors. 

These aspects are interrelated—for example, dimension¬ 
ality reduction directly affects the way we index the output 
vectors. While high dimensional vectors produce more accu¬ 
rate search results, low dimensional vectors are easier to index 
and require less operations and storage. 

Consider a set of W video frames to be used for training 
purposes. Lor the wth training frame, 1 < u> < W, a vi¬ 
sual feature detector and descriptor (e.g., the SILT detector 
and descriptor 1^) is calculated, thereby producing fea¬ 
ture vectors iw,k, 1 < ^ < K^, each with dimension 1 x F. 
The ensemble of these features comprises the wth training 
frame’s set of visual features Fu, = {f,u,i, fw, 2 , ■■■, iw,K„}- 
The concatenation of all these sets for all W training frames, 
given by .? 2 , undergoes a cluster¬ 

ing approach, such as K-means 02, thereby grouping all 


vectors in Fnain into J clusters, with centers denoted by set 
Ctrain = {ci, C 2 ,..., Cj}. VLAD then encodes the set of 
visual features, of the ruth frame as the group of F- 
dimensional vectors 'Vwj (1 < j < </) given by 

^ ^ (^Wjk ) ( 1 ) 

Vfc: 0(f„,fc)=Cj 

where Q (fw,k) is the quantization function that determines 
which cluster belongs to. Then, the VLAD of the ruth 
frame is given by the vector of aggregated local differences 
Vyj = [vu),i • ■ • j], with dimension 1 x JF. All these 
vectors are concatenated into the W x JF-dimensional ma¬ 
trix Vtrain = [vi • • ■ Vn/] , which Comprises the VLAD 
encoding of the training set. In order to allow for further 
dimensionality reduction (thereby accelerating the matching 
process), principal component analysis (PCA) is applied to 
Vtrain. and the D most dominant eigenvectors are maintained 
in the D x JF matrix Ptrain in order to be used in the test set. 

When considering a test video frame, once its set of visual 
features Ftest = {ftest.i, ftest. 2 ,---, ftest,x,„,} is produced by the 
SILT descriptor (assuming A'test points were detected), VLAD 
performs the following step: (i) calculation of Vtestj (1 <J < 
J) via O with the precalculated center set C; (ii) aggregation 
of these into a JF x 1 composite vector and application of 
dimensionality reduction via the retained PCA coefficients in 
P 

tram* 

Vtest — Ptxain [^test,! * ' * Vtest,j] ; (2) 

where Vtest denotes the F x 1 VLAD of the test video frame af¬ 
ter compaction with PCA. The similarity between two VLAD 
vectors of two test video frames ti and t 2 is simply measured 
via St (2 = (vti,vt 2 ). Thresholding the set of similarity (i.e., 
inner product) results between a test video frame and the en¬ 
tire test set of video frames provides the list of similar frames 
retrieved under the selected threshold value. 

2.3. Hyper-Pooling 

A recent method proposed by Douze et al. 0 makes use of 
hyper-pooling (HP) strategies on the video description level. 
Hyper-pooling works by using a second layer of data clus¬ 
tering and encoding a set of frame VLAD descriptors into 
a single vector. Hyper-pooling utilizes an enhanced hash¬ 
ing scheme by exploiting the temporal variance properties of 
VLAD vectors 0 that have been produced per frame. Af¬ 
ter performing PCA, the temporal variance of VLAD vec¬ 
tors is most prominent in the components associated with low 
eigenvalues. Hence, hyper-pooling postulates that we can get 
a more stable set of centers by applying a clustering algo¬ 
rithm (such as K-means) on the set of components relating to 
the highest eigenvalues. Indeed, hashing the components that 
vary less with time has been shown to provide better results 
in terms of stability and robustness to noise 0. 


2.4. Motivation Behind the Proposed Concept 

From the previous description, it is evident that the crucial 
aspects of VLAD and hyper-pooling are the clustering and 
the PCA process performed on the training set. Ideally, for 
a given set of video frames, we would like to produce prin¬ 
cipal component vectors for compaction of VLADs that do 
not change substantially when the video frames undergo real- 
world visual distortions. For example, consider two ensem¬ 
bles of training video frame sets, Xciean and Inoisyj with the 
latter produced by distorting the video frames in Xciean via 
blurring, compression artifacts, rotation, gamma changes, etc. 
During the training stage, applying PCA on the vectors of 
local differences (obtained per frame) will produce D domi¬ 
nant eigenvectors forming the DxJF matrices Ptrain.ciean and 
Ptiain.noisy- In case of hyper-pooIing the aforementioned matri¬ 
ces will have a dimension of I? x JDq, where Dq is the num¬ 
ber of dimensions retained after the first VLAD stage. Ideally, 
the vectors in Ptrain.ciean and Ptrain.noisy should be reasonably- 
well aligned, which is an indication that the compaction pro¬ 
cess is robust to noise. This can be tested by computing the 
sum-of-inner-products between the D dominant eigenvectors 
of both cases: For both VLAD and hyper-pooling, we obtain 


^{VLAD.HP},clean,noisy 


D {.JF,JDo} 

E E Ptrain.ciean [l J] Ptrain 


.noisy 


[^,J] 


i=l i=l 

(3) 


where p [i,j] denotes the (i, j) element of P. We carried out 
such an indicative test in a set of Z = 2000 video frames 
taken from 10 video clips of 10-minute duration each. Each 
video underwent seven different visual distortions, as tabu¬ 
lated in Table [1] and detailed in Section |4] Using J = 128 
clusters for VLAD and P = 128 for dense SIFT, we obtain 
^VLAD.clean.noisy — 0.0085 and -SHP.clean,noisy — 0.0445. HoW- 

ever, utilizing the SIFT vectors directly, performing PCA de¬ 
composition to produce the two DxF matrices PsiPT.train.ciean 
and PsiFr,train.noisy, and computing 

D F 

^SIFT.clean,noisy — EE PSIFT.train,clean [l j] PSIFT.train,noisy [L j] 

i=l j=l 

(4) 

we get ssiFT.ciean.noisy = 0.996. The significant difference be¬ 
tween SsiFT.clean.noisy and -SvLAD.clean.noisy and 5hp, clean,noisy rep¬ 
resents the reduction in tolerance to distortions incurred when 
the vectors are projected to their principal components, which 
is performed in order to gain the benefit of compaction. 

In this paper, our aim is to design a method leading to 
the same compaction factor as VLAD, albeit having increased 
tolerance to distortion in the video frames, which will allow 
for high recall rates even when dealing with distorted versions 
of the input video content. A secondary aim is to design our 
approach in a way that directly deals with video segments 


rather that individual video frames, thus allowing for video 
segment similarity detection. These two aspects are elabo¬ 
rated in the next section. 

3. VECTOR OF LOCALLY AGGREGATED 
CENTERS 

3.1. VLAD per Video Frame 

The similarity between two videos can be estimated by ob¬ 
taining the VLAD inner products per frame and averaging. 
We consider this approach as the baseline for video similarity 
detection. This direct application of VLAD to video achieves 
good results in terms of retrieval accuracy, albeit at the ex¬ 
pense of high complexity and storage requirements, even 
when the video is sampled at a substantially lower frame- 
rate. All the solutions proposed are designed to approach the 
performance of this baseline as much as possible while requir¬ 
ing a fraction of its computational complexity and storage, 
or, alternatively, significantly-exceed the VLAD performance 
while incurring the same complexity and storage. 

3.2. Temporal Compaction for Video Segment Searching 

Video description algorithms such as hyper-pooling 0 were 
designed for holistic video description, namely, the derived 
vector describes the entire video information as a whole. 
Temporal coherency is lost when using such holistic descrip¬ 
tion methods, thereby making the detection of video segments 
within longer videos impossible. This problem can be solved 
by modifying holistic solutions to work on groups of frames 
(GoFs) within each given video. GoFs can be viewed as fixed- 
size temporal windows, each of which is then compacted into 
a single VLAD, hyper-pooling or VLAC descriptor (referred 
to as VLAD-GoF, HP-GoF and VLAC-GoF, respectively). 
A video segment can then be matched by finding maximum 
inner product between its VLAD-GoF, HP-GoF, or VLAC- 
GoF descriptor and the corresponding descriptor from the a 
GoF in the video. Evidently, the length of the GoE controls 
the accuracy of the detection of video segments within longer 
videos. In addition, GoEs can also be overlapping to allow 
for better temporal resolution within the matching process. 

3.3. Proposed Vector of Locally Aggregated Centers 

Instead of clustering the local descriptors found within each 
GoE, we propose to cluster the centers of clusters of local de¬ 
scriptors. The aim is to produce results that are increasingly 
robust to distortions that may be found in a typical large video 
database. Encoding centers is expected to be more robust to 
such visual distortions since, compared to local feature de¬ 
scriptors, the centers of local feature descriptors will vary less 
when artifacts from processing are incurred on video frames. 

Consider T training GoEs stemming from a set of training 
videos. Erom the frames of each rth GoE (r G [1, • ■ •, T]), 


we extract a set of Kt dense SIFT feature vectors = 
{f^ i, ..., }, each having F dimensions. From each 

Fr, we calculate N local feature centers (LFCs) Cr = 
{ct- 1 , ... ,Ct- 7 v}- By concatenating the LFCs for each r, 
we acquire the training set of LFCs Cuain = {Ci, ■ ■ ■, Ct}- 
We then apply a second stage of clustering on Ctrain to 
generate a set of M centers of LFCs (CLFCs) Cenc = 
{cenc.i, Cenc.Af}, where each CLFC has F dimensions. 

We now consider a test video query Ui that contains G 
GoFs. For every g G [1,. ■., G], we extract Kg local features 
to obtain Fg = {fg,i, ..., fg,Kg}- Then, for every Fg, we ob¬ 
tain a set of N local feature centers Cg = {c^ i, ..., tv}- 
Using VLAD we encode each set of centers Cg with the set 
of trained centers Cenc to generate a vector of locally aggre¬ 
gated centers (VLAC). Particularly, we first obtain the F- 
dimensional vector Vg „j for each center Cenc,m in Cenc by ap¬ 
plying 


E 

Vn: Q(Cg,„)=Cem 


( , n ^enc, m ) • 


(5) 


The VLAC for g is then obtained by concatenating 'Vg^m for 
all m G [1, 2,,..., M] into a single 1 x MU-dimensional 
vector Vg = [vg i,..., Vg^Af] • We observe that N does not 
affect the dimension of VLAC, but serves as a control vari¬ 
able for the coarseness of the description. After calculating 
Vg for all p C [1, 2,,..., G], we project them on a trained set 
of D principal eigenvectors to perform dimensionality reduc¬ 
tion. We then concatenate these vectors to generate a com¬ 
pact G x Z?-dimensional vector Vii^ = [vu ;4 ••• Vu^^g] 

for video Ui. The similarity between two videos ui and U 2 is 
given by calculating Sui,u 2 = {'^ui, v^a)- A threshold is then 
applied on to determine whether the videos are simi¬ 

lar. If two videos contain a different number of GoFs (e.g., 
Gi and G 2 GoFs with G 2 > Gi), Su^,U 2 is calculated for all 
possible alignments k of the vectors v„j and v„ 2 . Finally, the 
maximum over k is taken to be the similarity score. This can 
be expressed as 

Gi 

^ui,U2 — mux {^ui,gt'^U2,g+k} 

9=1 

Vfc e{l, 2, ...,(G2 -Gi)} (6) 


Examining the performance of VLAC under the exper¬ 
iment of Section l24l we obtain SvLAC.ciean,noisy = 0.1131, 
which is more than 13 times higher than S{vlad,hp}, dean,noisy 
We therefore expect the proposed method to be significantly 
more robust than VLAD and hyper-pooling when assessing 
video similarity under noisy conditions. However, in order 
to be suitable for video retrieval, it must also be discrimina¬ 
tive, i.e., be able to differentiate between dissimilar videos 
that would inherently lead to different features. This is as¬ 
sessed experimentally in the following section. 


4. EVALUATION OF VIDEO DESCRIPTORS 

4.1. Dataset 

We selected 100 random videos from the Open Video Project 
(OVP), comprising 1000 minutes of video. Seven types of 
distortions (Table [T]i were applied to this footage to examine 
the performance of VLAD, hyper-pooling (HP) and VLAC 
under noise. Training for VLAD, VLAC and HP centers was 
done on different OVP videos from the utilized test material. 

To generate the queries, one-minute video segments were 
extracted from each original videos. Then, the dataset and 
query videos were sampled at a rate of ^ frames-per-second 
(fps). The sampling of the query videos, however, is shifted 
by 0.25 seconds with respect to the sampling of the videos in 
the dataset. In this way, sampling misalignments were also 
taken into account. First, we evaluate the similarity detec¬ 
tion of the proposed VLAC versus the state-of-the-art VLAD 
when both are extracted from each sampled frame in the se¬ 
quence (that is, GoF = 1). For VLAD, we set J = 128, while 
for VLAC we use N = 128 and M = 16. This provides 
an upper bound on the detection accuracy and assesses the 
performance of the proposed method versus the standard per- 
frame VLAD. Next, the proposed VLAC-GoF is compared 
against VLAD-GoF and HP-GoF, where one descriptor per 
GoF of 5 frames is derived and the overlap is set to one frame. 
Concerning the parameters for each method, we use J = 128 
for VLAD-GoF, N = 256 and M = 16 for VLAC-GoF. For 
HP-GoF, the number of centers used to encode the first stage 
VLAD is a\ = 128 and for the second stage = 32, where 
we keep 512 dimensions from the first stage VLAD. 

4.2. Performance and Results 

Fig. [T] depicts the precision versus the recall achieved with 
the proposed VLAC and the state-of-the-art VLAD ifTOl . 
when both descriptors are extracted from each of the frames 
in the compared video segments. The results show that 
the proposed descriptor offers a substantial detection ac¬ 
curacy improvement compared to VLAD across the entire 
precision-recall range. The improved performance of VLAC 
can be explained by its improved tolerance to noise, i.e., 

-^VLAC,clean,noisy ^ 'l’{VLAD,HP},clean,noisy5 which indicates that 

the principal component projections do not vary substantially 
after the application of distortions. Therefore, VLAC re¬ 
tains more information after being projected on its trained 
principle components. Note that the training videos used to 
generate the principal components did not have any distor¬ 
tions applied on them; this is to simulate real-life conditions 
where we cannot predict the distortions in the dataset. In 
addition, all distortions were applied on all videos in the 
dataset, meaning that higher recall reflects higher tolerance to 
distortions. Same observations can be made from the results 
in Fig. m where our VLAC is compared against VLAD and 
hyper-pooling for a GoF of size 5 frames. 


Distortion 

Parameters 

Scaling 

FFMPEG:-vf scale=iw/2 :-1 

Rotation 

FFMPEG:-vf rotate ^ 

Blurring 

FFMPEG:-vf boxblur 1:2:2 

Compression 

FFMPEG: -erf 35 

Gamma Correction 

FFMPEG: -vf mp 1:1.2:0.5:1.25:1:1:1 

Flicker 

OpenCV: Random brightness change (120%-170%) 

Perspective Change 

OpenCV AffineTransform triangle [(0,0), (0.85,0.1), (0,1)] 


Table 1. Set of distortions applied to the videos in the database. 



(a) 



(b) 


Fig. 1. Precision versus recall for VLAD ifTOl and the pro¬ 
posed VLAC, when extracted per each frame (GoF = 1); (a) 
D = 128 and (b) D = 256. 



(a) 



(b) 


Fig. 2. Precision versus recall for VLAD ifTOll . HP jS), and the 
proposed VLAC under GoF = 5 and the overlap is 1 frame; 
(a) D = 128 and (b) D = 256. 




























































































D 

mAP 

VLAD moj 

256 

0.7462 


128 

0.6761 

Proposed VLAC 

256 

0.9600 


128 

0.9330 

VLAD-GoE ifTol 

256 

0.5647 


128 

0.5262 

Proposed VLAC-GoE 

256 

0.7147 


128 

0.6493 

HP-GoE 15J 

256 

0.4382 


128 

0.4135 


Table 2. Mean Average Precision (mAP) for VLAD ifTOl . HP 
0 and the proposed VLAC under: frame-by-frame operation 
(top two) and GoF-based operation (bottom three). 

Table |2] shows the mean average precision (mAP) for the 
three compared methods, where D is the number of dimen¬ 
sions after projection. The results show that, under the same 
D, VLAC improves the mAP by 28.65% — 38.00% com¬ 
pared to VLAD for frame-by-frame matching and 23.39% — 
26.56% for GoF-based matching. The improvement offered 
by VLAC-GoF over HP-GoF reaches up to 63.10%. 

5. CONCLUSION 

We proposed a novel compact video representation method 
based on aggregating local feature centers. Our results show 
that encoding local feature centers yields significantly better 
results than simply encoding the features, which are less toler¬ 
ant to visual distortions commonly found in video databases. 
The proposed approach is therefore suitable for video similar¬ 
ity detection with robustness to visual distortions. The recall- 
precision results were improved without incurring extra com¬ 
plexity in the signature matching process. Future work will 
assess the performance of the proposed approach under un¬ 
controlled distortion conditions and even larger datasets. 
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